Fix GPU compilation of Dense(_, _, tanh) on CuArray#141
Fix GPU compilation of Dense(_, _, tanh) on CuArray#141ChrisRackauckas-Claude wants to merge 2 commits into
Conversation
The DeepSplitting GPU tests with `Dense(d, hls, tanh)` activations have
been failing on the self-hosted GPU runner since the SciMLBase v3 /
CUDA 6 / NNlib 0.9.3x bump:
InvalidIRError: compiling MethodInstance for gpu_broadcast_kernel_cartesian
Reason: unsupported dynamic function invocation
(call to var"#_#103"(kw::Base.Pairs{...}, c::ComposedFunction, x...)
@ Base operators.jl:1041)
Inside `NNlib.bias_act!`, the activation gets wrapped as a
`ComposedFunction(fast_act(σ, x), +)` and the result is broadcast on
GPU. For `σ = tanh`, `fast_act` substitutes `tanh_fast` (a polynomial
approximation), and the resulting `ComposedFunction{tanh_fast, +}`
broadcast kernel hits a dynamic dispatch on the `ComposedFunction`
kwsorter that the GPU compiler rejects. The same construction with the
device intrinsic `tanh` compiles cleanly — verified by Carlo Lucibello
on Metal in FluxML/Flux.jl#2633.
NNlib already exposes a per-array-type opt-out for exactly this case
(see `NNlib.fast_act` in NNlib/src/activations.jl:897-906). Add the
CuArray override so `Dense(_, _, tanh)` falls back to `Base.tanh` on
the GPU. NNlib is added as a direct dep so the override is unambiguous
rather than relying on Flux's transitive load.
This restores the 4 previously-failing DeepSplitting GPU tests
(`allen cahn`, `Black-Scholes Equation with Default Risk`,
`replicator mutator`, `allen cahn non local - Neumann BC`) — the same
set that was failing on `main` since PR SciML#137.
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Tracking down the upstream storyMetal already has the sister bug — fixed in NNlib v0.9.32The same failure is reported on Metal as using NNlib, Metal
W, x, b = Mtl.(rand(Float32, 3, 2)), Mtl.(rand(Float32, 2, 2)), Mtl.(rand(Float32, 3))
NNlib._fast_broadcast!(tanh ∘ (+), W*x, b) # ✅ OK
NNlib._fast_broadcast!(NNlib.tanh_fast ∘ (+), W*x, b) # ❌ InvalidIRError@christiangnrd narrowed it further: removing @device_override NNlib.tanh_fast(x) = Base.FastMath.tanh_fast(x)CUDA has the same bug — but no upstream issue or extension fixI searched FluxML/NNlib.jl, JuliaGPU/CUDA.jl, JuliaGPU/GPUCompiler.jl, So we are most likely the first to trip over this on CUDA. The reason Reduced CUDA MWE (would need GPU hardware)using CUDA, NNlib
a = CUDA.rand(Float32, 5)
b = CUDA.rand(Float32, 5)
# ✅ OK — Base.tanh, GPU compiler uses the device intrinsic
broadcast(tanh ∘ (+), a, b)
# ❌ InvalidIRError on `var"#_#103"(kw::Base.Pairs, c::ComposedFunction, x...)`
broadcast(NNlib.tanh_fast ∘ (+), a, b)Equivalently, what the failing tests actually trigger: using CUDA, Flux
m = Dense(2 => 3, tanh) |> Flux.gpu
x = CUDA.rand(Float32, 2, 4)
m(x) # forward fails with InvalidIRErrorWhy
|
The intentional `NNlib.fast_act(::typeof(tanh), ::CuArray) = tanh` opt-out from the previous commit is type piracy by design — that's exactly the shape NNlib's per-array-type fast_act API was built for. Tell `Aqua.test_piracies` to ignore methods we add to `NNlib.fast_act`. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
|
GPU run on That's intentional — we're adding a method to NNlib's |
Summary
Fixes the four
DeepSplittingGPU tests that have been failing on the self-hosted runner since the SciMLBase v3 / CUDA 6 / NNlib 0.9.3x bump:DeepSplitting algorithm - allen cahnDeepSplitting - Black-Scholes Equation with Default RiskDeepSplitting - replicator mutatorDeepSplitting algorithm - allen cahn non local - Neumann BC(Same set that was failing on
mainsince PR #137 / commit6fc04330.)Root cause
All four tests build
Dense(d, hls, tanh)layers and run them onCuArrayinput. InsideNNlib.bias_act!, the activation is wrapped into aComposedFunction(fast_act(σ, x), +)and broadcast on GPU. Forσ = tanh,fast_actsubstitutestanh_fast(a polynomial approximation built onevalpoly); the resultingComposedFunction{tanh_fast, +}broadcast kernel then hits a dynamic dispatch on theComposedFunctionkwsorter that the CUDA compiler rejects:The same construction with the device intrinsic
tanh(the original Base method) compiles cleanly — Carlo Lucibello reduced exactly this in FluxML/Flux.jl#2633 on Metal:Fix
NNlib already exposes a per-array-type opt-out for exactly this case via
NNlib.fast_act(σ, x)(see activations.jl:897-906). Add the CuArray override in HighDimPDE soDense(_, _, tanh)falls back toBase.tanhon the GPU. NNlib is added as a direct dep so the override is unambiguous rather than relying on Flux's transitive load.This is the same opt-out @mcabbott proposed for Metal; the upstream Metal PR (NNlib#666) used
@device_overrideinstead but was closed unmerged. Until NNlib lands a proper CUDA-side fix, this in-package override is the smallest change that gets the GPU tests green here.Test plan
allen cahn,Black-Scholes Default Risk,replicator mutator,allen cahn non local - Neumann BCgo fromerroredtopassed.::CuArray).relu(which already passed).🤖 Generated with Claude Code
This PR should be ignored until reviewed by @ChrisRackauckas.